New Arabic Medical Dataset for Diseases Classification

نویسندگان

چکیده

The Arabic language suffers from a great shortage of datasets suitable for training deep learning models, and the existing ones include general non-specialized classifications. In this work, we introduce new Arab medical dataset, which includes two thousand documents collected several websites, in addition to Medical Encyclopedia. dataset was built task classifying texts 10 classes (Blood, Bone, Cardiovascular, Ear, Endocrine, Eye, Gastrointestinal, Immune, Liver Nephrological) diseases. Experiments on were performed by fine-tuning three pre-trained models: BERT Google, Arabert that based with large corpus, AraBioNER corpus.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Dataset for Arabic Textual Entailment

There are fewer resources for textual entailment (TE) for Arabic than for other languages, and the manpower for constructing such a resource is hard to come by. We describe here a semi-automatic technique for creating a first dataset for TE systems for Arabic using an extension of the ‘headline-lead paragraph’ technique. We also sketch the difficulties inherent in volunteer annotators-based jud...

متن کامل

ASTD: Arabic Sentiment Tweets Dataset

This paper introduces ASTD, an Arabic social sentiment analysis dataset gathered from Twitter. It consists of about 10,000 tweets which are classified as objective, subjective positive, subjective negative, and subjective mixed. We present the properties and the statistics of the dataset, and run experiments using standard partitioning of the dataset. Our experiments provide benchmark results f...

متن کامل

The Arabic Online Commentary Dataset: an Annotated Dataset of Informal Arabic with High Dialectal Content

The written form of Arabic, Modern Standard Arabic (MSA), differs quite a bit from the spoken dialects of Arabic, which are the true “native” languages of Arabic speakers used in daily life. However, due to MSA’s prevalence in written form, almost all Arabic datasets have predominantly MSA content. We present the Arabic Online Commentary Dataset, a 52M-word monolingual dataset rich in dialectal...

متن کامل

The Enron Corpus: A New Dataset for Email Classification Research

Automated classification of email messages into user-specific folders and information extraction from chronologically ordered email streams have become interesting areas in text learning research. However, the lack of large benchmark collections has been an obstacle for studying the problems and evaluating the solutions. In this paper, we introduce the Enron corpus as a new test bed. We analyze...

متن کامل

Improved Classification of Medical Universities in Iran, a New Approach

Background: In order to check the practicality of classification of Universities of Medical Sciences (UMSs) based on their infrastructures, and scientific contributions, this study explored the most appropriate indicators to measure the size and productivity of UMSs. Methods: In the first phase, we approached a group of experts who had a deep experience in the management of UMSs and in the mini...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Lecture Notes in Computer Science

سال: 2021

ISSN: ['1611-3349', '0302-9743']

DOI: https://doi.org/10.1007/978-3-030-91608-4_20